首页> 外文OA文献 >Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer
【2h】

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

机译:探索流媒体端到端语音的体系结构,数据和单元   使用RNN-Transducer识别

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We investigate training end-to-end speech recognition models with therecurrent neural network transducer (RNN-T): a streaming, all-neural,sequence-to-sequence architecture which jointly learns acoustic and languagemodel components from transcribed acoustic data. We explore various modelarchitectures and demonstrate how the model can be improved further ifadditional text or pronunciation data are available. The model consists of an`encoder', which is initialized from a connectionist temporalclassification-based (CTC) acoustic model, and a `decoder' which is partiallyinitialized from a recurrent neural network language model trained on text dataalone. The entire neural network is trained with the RNN-T loss and directlyoutputs the recognized transcript as a sequence of graphemes, thus performingend-to-end speech recognition. We find that performance can be improved furtherthrough the use of sub-word units (`wordpieces') which capture longer contextand significantly reduce substitution errors. The best RNN-T system, atwelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000wordpieces as output targets achieves a word error rate of 8.5\% onvoice-search and 5.2\% on voice-dictation tasks and is comparable to astate-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.
机译:我们研究了使用当前的神经网络传感器(RNN-T)训练的端到端语音识别模型:一种流式,全神经,序列到序列的体系结构,可以从转录的声学数据中共同学习声学和语言模型成分。我们探索了各种模型体系结构,并演示了如果可以使用其他文本或发音数据,可以如何进一步改进模型。该模型由“编码器”和“解码器”组成,该“编码器”是从基于连接器的基于时间分类的(CTC)声学模型初始化的,而“解码器”是从在文本数据上训练的循环神经网络语言模型部分初始化的。整个神经网络都经过RNN-T损失训练,并直接将识别的转录本作为一组字素进行输出,从而执行端到端语音识别。我们发现,通过使用子词单元(“词件”)可以进一步提高性能,这些子词单元捕获更长的上下文并显着减少替换错误。最佳的RNN-T系统,十二层LSTM编码器和训练有30,000个单词作为输出目标的两层LSTM解码器的十二层LSTM编码器,语音搜索的单词错误率达8.5 \%,语音命令的单词错误率达5.2 \%,具有可比性语音搜索的最新基准为8.3%,语音命令的基准为5.4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号